Topic Modeling of Hierarchical Corpora /

نویسنده

  • Do-kyum Kim
چکیده

We study the problem of topic modeling in corpora whose documents are organized in a multi-level hierarchy. We explore a parametric approach to this problem, assuming that the number of topics is known or can be estimated by cross-validation. The models we consider can be viewed as special (finite-dimensional) instances of hierarchical Dirichlet processes (HDPs). For these models we show that there exists a simple variational approximation for probabilistic inference. The approximation relies on a previously unexploited inequality that handles the conditional dependence between Dirichlet latent variables in adjacent levels of the model’s hierarchy. We compare our approach to existing implementations of nonparametric HDPs. On several benchmarks we find that our approach is faster than Gibbs sampling and able to learn more predictive models than existing variational methods. Finally, we demonstrate the large-scale viability of our approach on two newly available corpora from researchers in computer security—one with 350,000 documents and over 6,000 internal subcategories, the other with a five-level deep hierarchy.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Variational Approximation for Topic Modeling of Hierarchical Corpora

We study the problem of topic modeling in corpora whose documents are organized in a multi-level hierarchy. We explore a parametric approach to this problem, assuming that the number of topics is known or can be estimated by cross-validation. The models we consider can be viewed as special (finite-dimensional) instances of hierarchical Dirichlet processes (HDPs). For these models we show that t...

متن کامل

HDPsent: Incorporation of Latent Dirichlet Allocation for Aspect-Level Sentiment into Hierarchical Dirichlet Process-Based Topic Models

We address the problem of combining topic modeling with sentiment analysis within a generative model. While the Hierarchical Dirichlet Process (HDP) has seen recent widespread use for topic modeling alone, most current hybrid models for concurrent inference of sentiments and topics are not based on HDP. In this paper, we present HDPsent, a new model which incorporates Latent Dirichlet Allocatio...

متن کامل

Topic Model Stability for Hierarchical Summarization

We envisioned responsive generic hierarchical text summarization with summaries organized by topic and paragraph based on hierarchical structure topic models. But we had to be sure that topic models were stable for the sampled corpora. To that end we developed a methodology for aligning multiple hierarchical structure topic models run over the same corpus under similar conditions, calculating a...

متن کامل

Modeling corpora of timestamped documents using semisupervised nonparametric topic models

In this paper we propose a nonparametric topic model to capture the evolution of text over time. Mixture models for modeling text documents based on hierarchical Dirichlet processes (HDP) have been used successfully in recent work to provide a nonparametric prior for the number of topics in the corpus eliminating the need to specify apriori the number of topics. We extend this model to addition...

متن کامل

The Discrete Infinite Logistic Normal Distribution for Mixed-Membership Modeling

We present the discrete infinite logistic normal distribution (DILN, “Dylan”), a Bayesian nonparametric prior for mixed membership models. DILN is a generalization of the hierarchical Dirichlet process (HDP) that models correlation structure between the weights of the atoms at the group level. We derive a representation of DILN as a normalized collection of gamma-distributed random variables, a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1409.3518  شماره 

صفحات  -

تاریخ انتشار 2014